Code
library(reticulate)Exercise 1 (Wilke on visualizing amounts) Read Chapter 6 of Wilke (2019).
This chapter begins by discussing bar charts. Many of you gravitate toward bar plots. Beware the tendency to overuse them! But if you are going to use them, you should use them well.
List some guidelines/advice Wilke gives about creating bar charts.
Rotating labels can result in ugly axis guide, a better solution is to swap x and y axis so that the bars run horizontally.
We need to pay attention to the order in which the bars are arranged.
A group bars figure is only primarily interested in specific differences in a particular group. It’s not good if we care more about the overall pattern.
Stacking is useful when the sum of the amounts represented by the individual stacked bars is in itself a meaningful amount.
When is it not advised to use a bar chart? Why?
What alternatives to bars are mentioned in this chapter?
What guidance does Wilke give about whether or not to stack bars vs. dodge them (using an offset in Vega-Lite)?
Stacked bars are useful when the sum of the amounts represented by the individual stacked bars is in itself a meaningful amount. Stacking is also appropriate when the individual bars represent counts.
Dodge bars are useful when we need to show a lot of information at once. It’s good if we are primarily interested in the differences in the levels of one variable among a particular group.
Recreate Figure 6.3 in Vega-Lite. [CSV]
library(reticulate)df = pd.read_csv("https://calvin-data304.netlify.app/data/cow-movies.csv")
df["amount_millions"] = df["amount"] / 1_000_000
chart_6_3 = alt.Chart(df, width=400, height=300).mark_bar().encode(
y =alt.Y("title_short:N", title = "", sort = "-x"),
x= alt.X("amount_millions:Q",
sort = 'ascending',
title = "weekend gross (million USD)",
scale=alt.Scale(domain=[0, 80]),
axis=alt.Axis(values=[0, 20, 40, 60, 80])
)
)
chart_6_3df_8 = pd.read_csv('https://calvin-data304.netlify.app/data/cow-income.csv')
include_race = ["asian", "white", "hispanic", "black"]
df_8_filtered = df_8[df_8["race"].isin(include_race)]
Chart_6_8 = alt.Chart(df_8_filtered,width = 400, height = 300).mark_bar().encode(
x=alt.X("race:N", title="", axis=alt.Axis(labelAngle=0), sort = include_race),
xOffset=alt.XOffset("age:N"),
y=alt.Y("median_income:Q", title="median income (USD)", scale=alt.Scale(domain=[0, 100000]), axis=alt.Axis(values = [20000, 40000, 60000, 80000, 100000],format="$,.0f")),
color=alt.Color("age:N", title="age (yrs)", scale=alt.Scale(scheme="blues"))
)
Chart_6_8df_9 = pd.read_csv('https://calvin-data304.netlify.app/data/cow-income.csv')
include_race= ["asian", "white", "hispanic", "black"]
df_9_filtered = df_9[df_9["race"].isin(include_race)]
df_9_filtered = df_9_filtered.assign(age=df_9_filtered["age"].str.replace("to", "-"))
Chart_6_9 = alt.Chart(df_9_filtered,width = 300, height = 200).mark_bar().encode(
x = alt.X("age:O",title = "age (years)", axis = alt.Axis(labelAngle = 0)),
y = alt.Y("median_income:Q", title = "median income (USD)", axis = alt.Axis(format = "$,.0f")),
facet = alt.Facet("race:N", title = None)
).configure_facet(columns = 2).resolve_scale(x = "independent", y = "shared")
Chart_6_9df3 = pd.read_csv('https://calvin-data304.netlify.app/data/cow-gapminder.csv')
df3_2007 = df3[(df3["year"] == 2007) & (df3["continent"]== "Americas")]
Chart_6_11 = alt.Chart(df3_2007,width=400, height=300).mark_point().encode(
x = alt.X("lifeExp:Q",title = "life expectancy (years)",
scale=alt.Scale(domain=[60, 82]),
axis=alt.Axis(values=[60,65, 70, 75,80], grid = True)),
y = alt.Y("country:N",title = "", sort = "-x", axis = alt.Axis(grid = True))
)
Chart_6_116.12 uses bars, which are too long and draw attention away from the data. 6.13 is disordered, it’s hard to convey a clear message.
Exercise 2 (A video presentation by Healy)
Exercise 3 (Heat maps)
Figure 6.14
df14 = pd.read_csv('https://calvin-data304.netlify.app/data/cow-internet2.csv')
df14_wrangled = df14[df14["year"] > 1993]
# fill in NA
df14_wrangled.fillna(0, inplace = True)<string>:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
# sort order
sorted_countries = df14_wrangled[df14_wrangled["year"] == 2016].sort_values("users", ascending = False)["country"].tolist()
Chart_6_14 = alt.Chart(df14_wrangled, width = 500, height = 400).mark_rect().encode(
x = alt.X("year:O",title = "", axis=alt.Axis(values=[1995,2000, 2005, 2010,2015], labelAngle = 0)),
y = alt.Y("country:N", title = "", axis = alt.Axis(orient="right"), sort = sorted_countries),
color = alt.Color("users:Q", title = "internet users / 100 people", legend=alt.Legend(orient="top", values = [0,25,50,75,100]),
scale = alt.Scale(scheme = 'inferno')))
Chart_6_14Exercise 4 (Pie charts)
Wilke also noted that pie chart doesn’t allow easy visual comparison of relative porportions.
In addition, Wilke argues that pie chart does not work well when the whole is broken into many pieces, or for visualization of many sets of proportions or time series of proportions.
However, Wilke noted that pie chart can clearly visualizes the data as proportions of a whole,emphasize simple fractions, and looks visually appealing for small datasets.
alternatives: stacked bars, side-by-side bars.
Stacked bars: Stacked bars, like pie chart can visualize the data as proportion of a whole, but in addition can work well fo visualization of many sets of porportions or time series of proportions, however, it does not look visually appealing for small dataset nor emphasizes simple fractions
Side-by-side bars: Similar with pie chart, side-by-side bars looks visually appealing even for small datasets, in addition, it allows visual comparison of relative proportions and work well when the whole is broken into many pieves which pie chart cannot do. However, it cannot clearly visualizes the data as proportions of a whole, and cannot visually emphasizes simple fractions like pie chart can do.
data = pd.DataFrame({
'party': ['FDP','SPD','CDU/CSU'],
'seats': [39,214,243]
})
Order = ['FDP','SPD','CDU/CSU']
data['order'] = pd.Categorical(data['party'], categories=Order, ordered=True)
base = alt.Chart(data, width = 300, height = 300).mark_arc().encode(
theta = alt.Theta("seats:Q").stack(True),
color = alt.Color("party:N", legend = None, sort = Order),
order = alt.Order("order:O")
)
pie = base.mark_arc(outerRadius=120)
text_chart = base.mark_text(radius = 80, size = 25).encode(
text = "seats:Q",
color = alt.value("white")
)
text_chart2 = base.mark_text(radius = 140, size = 10).encode(
text = "party:N",
color = alt.value("black")
)
Total_chart = pie + text_chart + text_chart2
Total_charta.What is the most interesting lesson, guide, or piece of advice Tufte offers you in this chapter?
On page 37, Tufte wrote: “The problem with time-series is that the simple passage of time is not a good explanatory variable: descriptive chronology is not a causal explanation”.
This is interesting because we sometimes link the correlation in the time-series graphic with some external reasons and explainations. In fact, we cannot make such conclusion. Nonetheless, it gives us an idea of the trend and prompt further explore.
b.Tufte shares some of his favorite graphics in this chapter. Pick one (but not the one about the military advance on and retreat from Russia) and answer the following.
This is an interesting trail graph, and I like how it challenges the assumption that unemployment rate and inflation are inversely related.
Position (X axis): male unemployment rate.
Position (Y axis): Increase in CPI.
Line (connect points): time.
Text: Guides for year.
I think everything in this graph would be possible to implement in Vega-Lite.
It is an example of relational graphic that links two variables, encouraging and imploring the view to assess the possible causal relationship between the variables. This graphic in particular confronts the commonly held belief that inflation and unemployment rate are inversely related to each other.
List one or two ideas that you learned in these sections that will change the way you design and create data graphics.
Exercise 2.13 from book:
Step 1: List three things that are not ideal about this graph. - The guide is unclear. We don’t know what the response and completion rates means. In addition, the X axis is unclear. - Completion rates are bars and response rate are lines. It suggests that completion are treated as a discrete data, while response rate is a continuous variable. - The text and the y axis guides are redundent. In addition, the orange texts such as “2.3%” overlap a little with the blue bars and sometimes with the orange lines.
Step 2: For each, describe how you would overcome the given challenges - I would change the titles from “Response and Completion Rates” to “Response and Completion Rates of ____ from 2017 to 2019”. Add title to X-axis: “Time (measured Quaterly)” - I would change the bars of completion rate into lines. - I would only show the beginning and ending values, leaving the rest for the y axis guide.
Step 3:
a. Graphic:
import pandas as pd
import altair as alt
# Data load-in
data = {
"values": [
{"Date": "Q1-2017", "Completion Rate": 0.91, "Response Rate": 0.023},
{"Date": "Q2-2017", "Completion Rate": 0.93, "Response Rate": 0.018},
{"Date": "Q3-2017", "Completion Rate": 0.91, "Response Rate": 0.028},
{"Date": "Q4-2017", "Completion Rate": 0.89, "Response Rate": 0.023},
{"Date": "Q1-2018", "Completion Rate": 0.84, "Response Rate": 0.034},
{"Date": "Q2-2018", "Completion Rate": 0.88, "Response Rate": 0.027},
{"Date": "Q3-2018", "Completion Rate": 0.91, "Response Rate": 0.026},
{"Date": "Q4-2018", "Completion Rate": 0.87, "Response Rate": 0.039},
{"Date": "Q1-2019", "Completion Rate": 0.83, "Response Rate": 0.028}
]
}
df = pd.DataFrame(data["values"])
Order = ["Q1-2017","Q2-2017","Q3-2017","Q4-2017","Q1-2018","Q2-2018","Q3-2018","Q4-2018","Q1-2019"]
# Completion Rate Chart
Completion_chart = alt.Chart(df, width=400, height=300).mark_line(color="blue").encode(
x=alt.X('Date:N', title="Date (measured quarterly)", sort=Order),
y=alt.Y('Completion Rate:Q', title="Completion Rate", axis=alt.Axis(titleColor="blue"))
)
# point
Completion_chart_point = alt.Chart(df, width=400, height=300).mark_point(color="blue",size = 20).encode(
x=alt.X('Date:N', title="Date (measured quarterly)", sort=Order),
y=alt.Y('Completion Rate:Q')
)
# Text
Text_completion = alt.Chart(df[df['Date'].isin(["Q1-2017", "Q1-2019"])]).mark_text(align='center', dy=10, color='blue').encode(
x=alt.X('Date:N', sort=Order),
y=alt.Y('Completion Rate:Q'),
text=alt.Text("Completion Rate:Q", format=".2f")
)
Completion_chart_text = alt.layer(Completion_chart, Text_completion,Completion_chart_point)
# Response Rate Chart
Response_chart = alt.Chart(df, width=400, height=300).mark_line(color="orange").encode(
x=alt.X('Date:N', title="Date (measured quarterly)", sort=Order),
y=alt.Y('Response Rate:Q', title="Response Rate", axis=alt.Axis(titleColor="orange"))
)
# point
Response_chart_point = alt.Chart(df, width=400, height=300).mark_point(color="orange", size = 20).encode(
x=alt.X('Date:N', title="Date (measured quarterly)", sort=Order),
y=alt.Y('Response Rate:Q')
)
# Response Rate Text
Text_response = alt.Chart(df[df['Date'].isin(["Q1-2017", "Q1-2019"])]).mark_text(align='center', dy=10, color='orange').encode(
x=alt.X('Date:N', sort=Order),
y=alt.Y('Response Rate:Q'),
text=alt.Text("Response Rate:Q", format=".3f")
)
Response_chart_text = alt.layer(Response_chart, Text_response, Response_chart_point)
# Combine the two charts side by side with independent y-axes
Total_chart = alt.layer(Completion_chart_text, Response_chart_text).resolve_scale(y="independent").properties(
title=alt.TitleParams(
text="Completion and Response Rates of Survey from 2017 to 2019",
anchor="middle",
fontSize=16
)
).configure_axisX(
labelAngle=0
)
Total_chartb. Identify some ways in which your design was affected by the things you read or the examples you saw in this assignment. Tufte talked about how time-series graphics are great for richer, more complex, more difficult data. He stressed on the efficiency of data graphics. Therefore, I chose to layer the graphics together, instead of concatenation, so that more data can be visualized efficiently in a smaller space. In addition, some graphics Tufte showed have two y axis, one on each side (example: page 15). It shows that it’s okay to have two different y axis as long as they are accurately presented.